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Abstract 

The problem of principle component analysis (PCA) is traditionally solved by spectral 
or algebraic methods. We show how computing the leading principal component could be 
reduced to solving a small number of well-conditioned convex optimization problems. This 
gives rise to a new efficient method for PCA based on recent advances in stochastic methods 
for convex optimization. 

In particular we show that given a. d x d matrix X = ^ ^*^7 with top eigenvector 

u and top eigenvalue Ai it is possible to: 

• compute a unit vector w such that (w^u)^ > 1 — e in O -|- V) time, where 5 = 
Ai — A 2 and N is the total number of non-zero entries in Xi,..., x„, 

• compute a unit vector w such that w^Xw > Ai — e in Oidje^) time. 

To the best of our knowledge, these bounds are the fastest to date for a wide regime of 
parameters. These results could be further accelerated when d (in the first case) and e (in 
the second case) are smaller than \Jd/N. 


1 Introduction 

Since its introduction by Pearson [20] and Hotelling m, the principle component analysis 
technique of finding the subspace of largest variance in the data has become ubiquitous in 
unsupervised learning, feature generation and data visualization. 

For data given as a set of n vectors in xi,..., x„, denote by X the normalized covariance 
matrix X = ^ The PCA method finds the /c-dimensional subspace such that the 

projection of the data onto the subspace has largest variance. Formally, let W G be an 

orthogonal projection matrix, PCA could be formalized as the following optimization problem. 

max ||XW||F (PCA) 

,wrw=i 

where || • ||ir is the Frobenius norm. Note that the above optimization problem is an inherently 
non-convex optimization problem even for k = 1. Henceforth we focus on the problem of finding 
with high precision only the leading principal component, i.e. on the case k = 1. 

Finding the leading principal component could be carried out using matrix factorization 
techniques in time 0{nd^ + by explicitly computing the matrix X and then computing its 
singular value decomposition (SVD). However this requires super-linear time and potentially 
0{d^) space. 

Since super-linear times are prohibitive for large scale machine learning problems, approx¬ 
imation methods have been the focus of recent literature. Iterative eigenvector methods, such 
as the Lanczos method or Power method m, can be applied without explicitly computing the 
matrix X. These latter methods require the data to be well-conditioned, and the spectral gap. 
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i.e., the distance between largest and second largest eigenvalues of X, to be bounded away from 
zero. 

If we let (5 > 0 denote this spectral gap, then the Power and Lanczos methods requires 
roughly (Ai/5)“'>' passes over the entire data to compute the leading PC which amounts to 
0{{Xi/6)'^N) total time, where 7 = 1 for the Power method and 7 = 1/2 for the Lanczos 
method, and N it total number of non-zero entries in the vectors xi, 

Thus, iterative methods replace the expensive super-linear operations of matrix factorization 
by linear-time operations, but require several passes over the data. Depending on the spectral 
gap of X, the latter methods may also be prohibitively expensive. This motivates the study of 
methods that retain the best from both worlds: linear time per iteration, as well as doing only 
a small number (i.e., at most logarithmic in the natural parameters of the problem) of passes 
over the entire data. 

In the convex optimization world, such hybrid results with simple iterations coupled with 
only a few passes over the data were obtained in recent years [HIISIIIS]. Recently Shamir 
[MIES] made headway in the non-convex spectral world by suggesting a stochastic optimiza¬ 
tion method for the problem, that is based Oja’s randomized Power method |19l 0] with an 
application of the variance reduction principle demonstrated in [14j . 

Shamir’s algorithm runs in total time of 0{6~^d + N), assuming the availability of a unit 
vector w such that (w''~u)^ = D(l), called a “warm start”. Finding such a warm start vector w 
could take as much as dl^/X^5-^d) time using existing methods 0, hence the total running 
time becomes 0{-^Xi/SS~'^d + N). Quite remarkably, Shamir’s result separates the dependency 
on the gap 5 and the data size N. 

Analyzing the Power method in case of stochastic updates, as done in HEH, is much more 
intricate than analyzing stochastic gradient algorithms for convex optimization, and indeed both 
the analysis of Oja’s algorithm in m and Shamir’s algorithm in [241 [Mj are quite elaborate. 

In this paper we continue the search for efficient algorithms for PCA. Our main contribu¬ 
tion, at a high-level, is in showing that the problem of computing the largest eignevector of 
a positive semidefinite matrix could be reduced to solving only a poly-logarithmic number of 
well conditioned unconstrained smooth and strongly convex optimization problems. These well- 
conditioned convex optimization problems could then be solved by a variety of algorithms that 
are available in the convex and stochastic optimization literature, and in particular, the recent 
algorithmic advances in stochastic convex optimization, such as the results in naiHiiiaiiaiTi. 
As a result, we derive algorithms that in terms of running time, their worst case running time 
starting from a “cold-start” is equivalent or better than that of the algorthm of Shamir ini¬ 
tialized with a “warm-start” . 

1.1 Problem setting and main resnlts 

Assume we are given a set of n vectors in M'’*, xi, ...,x„, where n > d and let us denote by X 
the normalized covariance matrix X = Assume further that X has an eigengap 

of 5, i.e. Ai(X) — A 2 (X) = <5, and w.l.o.g. that ||xj|| < 1 for alH G [n]. Our goal is to find a 
unit vector w such that 


(w^u)^ >l — e, 

where u is the leading eigenvector of X and e is the desired accuracy parameter. 

^Finding a “warm start” could be carried out either by applying iterative methods such as Power and Lanczos 
methods to the entire data or by applying them only to a small random sample of the data. Since we want to be 
overall competitive with these methods, we focus on the second option. 
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In the rest of the paper we denote the eigenvalues of X in descending order by Ai > A 2 > 
... > Xd and by u = ui,U 2 ,...,Urf the corresponding eigenvectors. We denote by N the total 
number of non-zero entries in the vectors xi, ...,x„. 

We assume we are given an estimate 5 such that 

Cl (5 < 6 < C26, 

for some universal constants ci,C 2 which satisfy C 2 — ci = 0(1). Note that finding such a 5 
could be done in time that is logarithmic in 1/5. 

The main result of this paper is the proof of the following high-level proposition. 

Proposition 1. There exists an algorithm that after solving a poly-logarithmic number of well- 
conditioned unconstrained convex optimization problems, returns a unit vector w, such that 
(w’^u)2 > 1 - e. 

Based on this proposition we prove the following theorems. 

Theorem 1.1. Fix e > 0,p > 0. There exists an algorithm that finds with probability at least 
1 — p a unit vector w such that (w'''u)^ > 1 — e, m total time O -|- N) . 

Theorem 1.2. Fix e > 0,p > 0. Assume that 5 = o{y/d/N). There exists an algorithm that 
finds with probability at least 1 —p a unit vector w such that (w''~u)^ > 1 — e, in total time 

o{^\ 

Throughout this work we use the notation O(-) to hide poly-logarithmic dependencies on 
d, e-^p-^5-h 


Method 

Complexity 

Comments 

Power method [TO] 

Tv 


Lanczos [TO] 

v't'v 


VR-PCA [Ml |25| 

v'xF + 'V 


Theorem 11.11 


fastest when 5 = Tl{^Jd/N) 

Theorem 11.21 (5 = o{y^d/N)) 

Ar3/4^1/4 

_vS_ 

fastest when Ai = ijj{^/d/N), 5 = o{^/d/N) 


Table 1: Comparison with previous eigengap-based results. Note that the result of [Ml IM] 
apply in general only from a “warm-start”, i.e. when initialized with a unit vector wq such that 
(wq u)^ = 0(1). Finding such a warm-start could be carried out efficiently using a sample of 
roughly 1/5^ matrices Xjx)*" and applying the Lanczos method to this sample. 


1.1.1 Gap-free results 

It makes since to not only consider the problem of approximating the leading eigenvector of 
the matrix X, but also the problem of approximating the corresponding eignevalue, that is the 
problem of finding a unit vector w such that 

w^Xw > Ai — e. 

For this purpose we also prove the following theorems which are analogous to Theorems 11.11 

0 
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Theorem 1.3. Fix e > 0,p > 0. There exists an algorithm that finds with probability at least 
1 — p a unit vector w such that w''"Xw > Ai — e, in total time O (^). 

Theorem 1.4. Fix e = o{\/d/N),p > 0. There exists an algorithm that finds with probability 
at least 1 — p a unit vector w such that w"*^Xw > Ai — e, in total time O ■ 

We note that Theorem [la in which the bound is independent of n, applies also in the case 
when X is not given explicitly, but only through a stochastic oracle that when queried, returns 
a rank-one matrix XjX^ sampled at random from an unknown distribution P that satisfies 
X = Ex~d[xx^]. See section [6] for more details. 


Method 

Complexity 

Comments 

Power method [T7] 

f. 


Lanczos [T7j 

V ^ 


Sample PM 

Ai d 

~T7^ 


Sample Lanczos 

1 Xi d 

\l~l^ 


Theorem 11.31 

dje^ 

fastest when e = 11 ^max{ 

Theorem 11.41 (e = o{y/d/N)) 

_ ^ _ 

fastest when Ai = oj{y/dlN), e = o{y/d/N) 


Table 2; Comparison with previous eigengap-free results. Sample PM and Sample Lanczos refer 
to the application of the standard Power Method and Lanczos algorithms to a random sample 
of size roughly e“^, chosen uniformly at random from the matrices {xjx^}”^j^. Such a sample 
suffices to approximate the top eigenvalue to precision e (using standard concentration results 
for matrices such as the Matrix Hoeffding inequality m)- 


The rest of this paper is organized as follows. In Section [2] we give relevant preliminaries 
on classical methods for the computation of the largest eigenvector and related tools, and also 
preliminaries on convex optimization. In Section [3] we present the core idea for our algorithms: 
a shrinking inverse power method algorithm that computes the leading eigenvector after com¬ 
puting only a poly-logarithmic number of matrix-vector products. Based on this algorithm, 
in Section 0] we present our convex optimization-based eigenvector algorithm that requires to 
solve only a poly-logarithmic number of well-conditioned convex unconstrained optimization 
problems in order to compute the largest eigenvector. In Section [5] we combine the result of 
Section H] with recent fast stochastic methods for convex optimization to prove Theorems 11.11 
O In Section [6] we prove Theorems 11.3111.41 . 

2 Preliminaries 

2.1 Classical algorithms for the leading eigenvector problem 

2.1.1 The Power Method 

Our main algorithmic tool for proving the convergence of our method is the classical Power 
Method for approximating the leading eigenvalue and eigenvector of a positive definite matrix. 
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Algorithm 1 Power Method 


Input: a positive definite matrix M 
Let wq to be a random unit vector 
for t = 1,2,... do 

end for 


IIMwt 


In our analysis we will require the following theorem that gives guarantees on the approx¬ 
imation error of Algorithm [1] Both parts of the theorem follows essentially form the same 
analysis. Part one upper bounds the number of iterations of the algorithm required to achieve 
a crude approximation to the leading eigenvalue, and part two upper bounds the number of 
iterations required to achieve a high-precision approximation for the leading eigenvector. 


Theorem 2.1. Let M be a positive definite matrix and denote its eigenvalues in descending 
order by Ai, A 2 ,..., and let ui,U 2 ,...,Urf denote the corresponding eigenvectors. Denote 5 = 


Ai — A 2 and k = H-. Fix an error tolerance e > 0 and failure probability p > 0. Define: 


5 

pPM 
' crude 


1 


(e,p) = r-ln ^ 


18d 


p2g 


-nFM 


{K,e,p) = [^In ( -^ 


Then, with probability 1 — p it holds that 

1. (crude regime) Vt > w^Mw* > (1 — e)Ai. 

2. (accurate regime) Vt > T^^{K,e,p): (w^ui)^ > 1 — e. 

In both cases, the success probability depends only on the random variable (w([ui)^. 
A proof is given in the appendix for completeness. 


2.1.2 The Inverse Power Method and Conditioning 

As seen in Theorem EH for a given matrix X, the convergence rate of the Power Method 
algorithm is strongly connected with the condition number «:(X) = which can be quite 

large. Consider now the following matrix 

:= (AI-X)"^ 


where A is a parameter. 

Note that if X is positive semidefinite and A > Ai(X), then is positive definite. Further¬ 
more, the eigenvectors of are the same as those of X and its eigenvalue are (in descending 
order) Ai(M“^) = Thus, if our goal is to compute the leading eigenvector of X we 

might as well compute the leading eigenvector of M“^. This is also known as the Inverse Method 

m- 


The following lemma shows that with a careful choice for the parameter A, we can make the 


condition number k(M 


to be much smaller than that of the original matrix X. 


Lemma 2.1 (Inverse Conditioning). Fix a scalar a > 0. Let M ^ = (AI — X) ^ such that 
Ai(X) + a5(X) > A > Ai(X). It holds that 


Ai(M-i) 

(I(M-i) 


< 1 + a. 
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Proof. We denote by Ai,A2,(5 the values Ai(X),A 2 (X) and 5(X) respectively. 
It holds that 

Thus we have that 
Ai(M-i) 


and the lemma follows. □ 

Of course, a problem with the above suggested approach is that we don’t know how to 
set the parameter A to be close enough to Ai(X). The following simple lemma shows that by 
approximating the largest eignevalue of (AI —X)“^, we can derive bounds on the suboptimality 
gap A — Ai(X), which in turn can be used to better tune A. 

Lemma 2.2. Fix a positive semidefinite matrix X and denote the largest eigenvalue by Ai. Let 
A > Ai and consider a unit vector w such that 

(AI - X)"^ w > (1 - e)Ai ((AI - X)"^) , 

for some e G (0,1]. 

Denote A = _i —. Then it holds that 

w ' (AI—X) w 

(1 — £)(A — Ai) ^ A < A — Ai. 

Proof. According to our assumption on w it holds that 

Ai ((AI - X)-i) > (AI - X)~^ w > (1 - e)Ai ((AI - X)"^) . 

Thus by our definition of A and the fact that Ai (AI — X)“^) = it holds that 

(l-e)(A-Ai) < A< A-Ai, 


1 


A-Ai 
1 


1 


1 


A - Ai V A - A 


A-Ai 
1 


A — A 2 
1 


-1 




A-Ai +(5 


-1 


1 


A-Ai V(A-Ai)(A-Ai + <5) 


A — Ai + (5 a6 

-r-< 1 + — = 1 + a, 


and the lemma follows. 

□ 

Thus, if we set for instance e = 1/2, and then define A' = A — A, then by Lemma 12.21 we 
have that A' — Ai < 

2.2 Matrix Inversion via Convex Optimization 

2.2.1 Smoothness and strong convexity of continuous functions 

Definition 2.1 (smooth function). We say that a function f : M. is ft smooth if for all 

X, y G it holds that 


||V/(x)-V/(y)||</3||x-y 
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Definition 2.2 (strongly convex function). We say that a function / : M is a-strongly 

convex if for all x, y G it holds that 

/(y) > /(x) + V/(x)T(y _ x) + |||x - yf. 

The above definition combined with first order optimality conditions imply that for an ol- 
strongly convex function /, if x* = argmin^giRd /(x), then for any x G it holds that 

/(x)-/(x*)>|||x-x*f. (1) 

Lemma 2.3 (smoothness and strong convexity of quadratic functions). Consider the function 

/(x) = ix’^Mx + b^x, 

where M G is symmetric and b G // M ^ 0 then /(x) is XifM.)-smooth and 

Xd(M.)-strongly convex, where Xi,Xd denote the largest and smallest eigenvalues o/M respec¬ 
tively. Otherwise, /(x) is ||M||-smoof/i. 


2.2.2 Matrix inversion via convex optimization 

In order to apply the Inverse Method discussed in the previous subsection, we need to apply 
Power Method steps (Algorithm [1]) to the matrix (AI — X)^^. Denote M = AI — X. Thus on 
each iteration of the Inverse Method we need to compute a matrix-vector product of the form 
M~^w. This requires in general to solve a linear system of equations. However, it could be also 
approximated arbitrarily well using convex optimization. Consider the following optimization 
problem: 

min{F(z) := -z'^^Mz — w^z}. (2) 

zSR*^ 2 

By the first order optimality condition, we have that an optimal solution for Problem ([2]) - 
z* satisfies that VF{z*) = Mz* — w = 0, meaning, 

z* = M^^w. 


Note that under the assumption that A > A(X) (as stated Subsection I2.1.2p it holds that 
M is positive definite and hence invertible. 

Most importantly, note that under this assumption on A it further follows from Lemma 12.31 
that F{z) is Ad(M) = (A — Ai(X))-strongly convex and Ai(M) = (A — Arf(X))-smooth and thus 
could be minimized very efficiently via algorithms for convex minimization. 

Since by using algorithms for convex minimization we can only find an approximated- 
minimizer of F{z), we must discuss the effect of the approximation error on the convergence of 
the proposed algorithms. As it will turn out, the approximation error that we will care about it 
the distance ||z — z*\\ where z is an approximated minimizer of F. The following lemma, which 
follows directly from the strong convexity of F and Eq. m, ties between the approximation 
error of a point z with respect to the function F and the distance to the optimal solution z*. 

Lemma 2.4. Given a positive semidefinite matrix X, a vector w, a scalar X such that X > Ai(X) 
and an error tolerance e, let M = AI — X, and denote by z* the minimizer of F[z) - as defined 
in Eq. m- Then, for any z it holds that 


z — z 


< 


i 2{F{z)-F{z*)) 
A-Ai(X) ■ 
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2.2.3 Fast stochastic gradient methods for smooth and strongly convex optimiza¬ 
tion 

In this subsection we briefly survey recent developments in stochastic optimization algorithms 
for convex optimization which we leverage in our analysis in order to get the fast rates for PC A. 
Consider an optimization problem of the following form 

1 ” 

min{F(z) :=-^/i(z)} (P) 

z n 


where we assume that each /j is convex and /3-smooth and that the sum -F(z) is cr-strongly 
convex. 

We are going to show that the PCA problem could be reduced to solving a series of convex 
optimization problems that takes the form of Problem (jP]) (this is actually not a precise state¬ 
ment since in our case each function /j won’t be convex on its own and we will need to address 
this issue). Thus, we are interested in fast algorithms for solving Problem (fPl) . 

The standard gradient descent method can solve Problem m toe precision in O ( (/3 /cj) log (1/e)) 
iterations were each iteration requires to compute the gradient of F[z). Thus the overall time 
becomes O ( ^Tq j where we denote by Tq the time to evaluate the gradient direction of F. 


of any of the functions /j. The dependence on the condition number ^ could be dramatically 
improved without increasing significantly the per-iteration complexity, by using Nesterov’s ac¬ 
celerated method that requires O ( y/S/a log(l/e)) iterations, and overall time of O ( y^/3 /aTc 


However, both methods could be quite computationally expensive when both ^ and Tg are 


large. 

Another alternative is to use stochastic gradient descent, which on each iteration t, performs 
a gradient improvement step based on a single function /j^ where it is chosen uniformly at 
random from [n]. This single random gradient serves as an estimator for the full gradient. The 
benefit of this method is that each iteration is extremely cheap - only requires to evaluate the 
gradient of a single function. However the convergence rate of this method is roughly l/(cje) 
which is ineffective when e is very small. The intuitive reason for the slow convergence is the 
large variance of the gradient estimator. 

Each of the methods mentioned above, the deterministic gradient descent, and the stochastic 
one has its own caveats. The deterministic gradient method does not consider the special 
structure of Problem ([P]) which is given by a sum of functions, and on the other hand, the 
stochastic gradient method does not exploit the fact that the sum of functions is finite. 

Recently, a new family of stochastic gradient descent-based methods was devised, which is 
tailored to tackling Problem © ISDEalEllIHlttaEaiT!. Roughly speaking, these methods 
apply cheap stochastic gradient descent update steps, but use the fact the the objective function 
is given in the form a finite sum, to construct a gradient estimator with reduced variance. 

For instance, the SVRG algorithm presented in [TTj, requires 0(log(l/e)) iterations to 
reach e accuracy, where each iteration requires computing a single full gradient of F{z) and 
roughly 0{/3/a) cheap stochastic gradient updates. Thus the total running time becomes 
O where Tg denotes the worst case time to evaluate the gradient of 

a single function /j. 

The following theorem summarizes the application of SVRG to solving Problem (©. For 
details see M- 
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Theorem 2.2 (Convergnec of SVRG). Fix e > 0. Assume each /j(z) is [5-smooth and F{z) is 
a-strongly convex. Then the SVRG Algorithm detailed in m finds in total time 

+ log(l/e)) 

a vector z G such that 


E[F(z)] — min F{z) < e. 

In recent works [zmi], it was demonstrated how methods such as the SVRG algorithm could 
be further accelerated by improving the dependence on the condition number (5/a. 

3 The Basic Approach: a Shrinking Inverse Power Algorithm 

The goal of this section is to present a Power Method-based algorithm that requires to compute 
an overall poly-logarithmic number of matrix-vector products in order to find an e approxi¬ 
mation for the leading eigenvector of a given positive semidefinite matrix X. 

Algorithm 2 Shrinking Inverse Power Method 

I: Input: matrix X G such that X ^ 0, Ai(X) < 1, an estimate 6 for (5(X), accuracy 

parameter e G (0,1) 

2: A(o) — 1 -|- 5 

3: S i — 0 

4: repeat 

5: S S -I- 1 

6: Let IVI^ = — X) 

7: Apply the Power Method (Algorithm [1]) to the matrix to find a unit vector w* 

such that 

wJMJ^Ws > iAi(M7^) 


k ' Ta/t-I - 

^ wj Ms Ws 

9: A(s) ^ ^(s-I) “ %■ 

10: until As < 5 
II: A(/) ^ A(^) 

12: Let IVIj = (A^j^I — X) 

13: Apply the Power Method (Algorithm [T]) to the matrix to find a unit vector wj 

such that 

(wju)^ > 1 — e 


14: return wj 

We prove the following theorem. 

Theorem 3.1. Assume 5 satisfies that 6 G [I 525 ]. There exists an implementation for Algo¬ 
rithm d that requires computing at most 

O (^og{d/p) log((5"^) log ^ = 0(1) 
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matrix-vector products of the form M“^w, where M is one of the matrices computed during the 
run of the algorithm or'M.j) and w is some vector, such that with probability at least 1—p 
it holds that the output of the algorithm, the vector Wf, satisfies: 

(wju)^ > 1 — e. 

In order to prove Theorem 13.II we need a few simple Lemmas. 

First, it is important that throughout the run of Algorithm [2l the matrices and the 
matrix Mj will be positive definite (and as a results so are their inverses). The following lemma 
shows that this is indeed the case. 

Lemma 3.1. For all s >0 it holds that > Ai. 

Proof. The proof is by a simple induction. The claim clearly holds for s = 0 since by our 
assumption Ai < 1. Suppose now that the claim holds for some iteration s. According to 
Lemma 12.21 it holds that the value A^+i computed on iteration s + 1 satisfies that 


As+i < A(s) ~ ^1- 


Hence, according to the algorithm, it holds that 


■^(s + l) -^(s) r. — -^(s) 


^{s) - 


-^( s ) + -^1 


> Ai, 


where the last inequality follows from the induction hypothesis. 


□ 


The following lemma bounds the number of iterations of the loop in Algorithm [2j 

Lemma 3.2. The repeat-until loop is executed at most 0(log(5“^)) times. 

Proof. Fix an iteration s of the loop. By applying Lemma 12.21 with respect to the unit vector 
Wg we have that 

As > -(A(s_i) - Ai). 

By the update rule of the algorithm it follows that 

A(s) - Ai = A(s_i)—^ - Ai < (A(s_i) - Ai) --(A(s_i) - Ai) 

= ^(A(s-i) - Ai). (3) 

Thus, after at most T = [log 3/4 ^ A(o/-Ai )~ l ~ (using our choice of A(o)) itera¬ 

tions, we arrive at a value A(j’) which satisfies ~ Ai < 5. By Lemma 12.21 it follows that 
in the following iteration it will hold that Ay+i < 5 and the loop will end. Hence, the overall 
number of iterations is at most T + 1 = 0(log(5“^)). □ 


Finally, the following lemma gives approximation guarantees on the estimate A^yp 

Lemma 3.3. Suppose that all executions of the Power Method in Algorithm are successful. 
Then it holds that 


^ 35 , , (5 

Ai + — > A(/) > Ai + -. 


( 4 ) 
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Proof. Denote hy Sf the last iteration of the loop in Algorithm [51 and note that using this 
notation we have that A(j) = ^(sf) and that < 5. Using Lemma 12.21 we thus have that 


\f) - -^1 - \sf-l) 



Ai < 2Asf 



Asf < 



which gives the first part of the lemma. 

For the second part, using Lemma 12.21 again, we have that 


\f) - -^1 


- 2 -“ 2^\sf-i) - Ai) 


(5) 


In case Sf = 1, then by our choice of A(g) we have that A(o) — Ai > S, and the lemma follows. 
Otherwise, by unfolding Eq. ([5]) one more time, we have that 


\f) 




Ai) > 


4 



where the second inequality follows from Lemma 12.21 and the last inequality follows from the 
stopping condition of the loop. □ 


We can now prove Theorem 13.11 


Proof. First a note regarding the success probability of the invocations of the Power Method 
algorithm in Algorithm |2J since, as stated in Theorem 12.11 the success of the PM algorithm 
depends only on the magnitude of (wj^u)^, and all matrices Ms,Mj in Algorithm [2] have the 
same leading eigenvector, as long as all invocations use the same random initial vector and 
number of steps that guarantees success with probability 1 — p, they all succeed together with 
probability at least 1 — p. 

Let us now assume that all executions of the Power Method algorithm in Algorithm [1] were 
successful. 

According to Lemma [3.21 the loop is executed at most 0(log(j“^)) = 0(log((5“^)) times 
(following our assumption on 6). Each iteration s of the loop requires to invoke the Power 
Method to approximate Ai(M) up to a factor of 1/2 which according to Theorem 12.11 requires 
computing Pcrnde(l/^>P) ~ 0{\og{d/p)) matrix-vector products, in order to succeed with prob¬ 
ability at least 1 — p. Thus the overall number of matrix-vector products computed during the 
loop is O (log(fi/p) log((5“^)). 

According to lemma [TS] it holds that A(j) — Ai < < 35. Thus, according to Lemma [2Tl 

we have that 




Ai(M/) 


< 4 = 0(1). 


Thus, in the final invocation of the PM algorithm, it requires at most r^^(4, e,p) = 
O ^log matrix-vector products to compute wj as desired, with probability at least 1—p. 

Thus the overall number of matrix-vector products is 


O (^og{d/p) log((5 log • 


□ 
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4 A Convex Optimization-based Eigenvector Algorithm 

In this section we present our algorithm for approximating the largest eigenvector of a given 
matrix based on convex optimization. The algorithm is based on Algorithm [2] from the previous 
section, but replaces explicit computation of products between vectors and inverted matrices, 
with solving convex optimization problems, as detailed in Subsection 12.21 

Towards this end, we assume that we are given access to an algorithm - A for solving 
problems of the following structure: 

min{Fw,A(z) := ■^z’^(AI - X)z - w’^z}, (6) 

zeR*^ 2 

where X is positive definite, A > Ai(X) and w is some vector. Note that under these conditions, 
the function T^v,A(z) is strongly convex. Note also that the minimizer of T\v.a(z) - z* is given by 
z* = (AI — X)^^w, and thus solving Problem ([6|) is equivalent to computing a product between 
a vector and an inverse matrix. 

There are a few issues with our approach that require delicate care however: 1) we need to 
pay close attention that the convex optimization problems are well-conditioned and 2) since now 
we use a numerical procedure to compute the matrix-vector products, we have approximation 
errors that we need to consider. 


Algorithm 3 Leading Eigenvector via Convex Optimization 


1: Input: matrix X G such that X ^ 0, Ai(X) < 1, an estimate S for <5(X), accuracy 

parameter e G (0,1), failure probability parameters p 


2: A 


(0) 


1 + 5 


3: Set: m-i 
4: Set: e G- 


5 

6 

7 

8 
9 

10 

11 

12 

13: 

14; 

15: 


^TfZea/8,P), 

minU 


16 \ 8 


^ , _ t^PM 

^2 ^ J^acc 
- \ m2+l 


(3, e/2, p) 


>418^ 


} 


Let wq be a random unit vector 
s ^ 0 
repeat 
s s + 1 

Let Ms = (A(s-i)I ~ 

for t = l...mi do 

Apply Algorithm A to find a vector wt such that ||wt — 

end for 


Ws 


Apply Algorithm A to find a vector such that 

As ■<— 


-M: 


Ws 


Wt_i|| < e 


< e 


w ' Vs— e 


16: A(s) ^ ^(s-1) - %■ 

17: until As < 5 

18 : \(f) •«— A(s) 

19: Let IVl^ = — X) 

20: for t = l...m 2 do 

21: Apply Algorithm A to find a vector such that ||wf — Mj^wt_i|| < e 

22: end for 

23: return Wf ■(— u 

J Wmo 
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4.1 Power Method with inaccurate updates 

In this section we analyze the potential convergence of the Power Method with inaccurate 
matrix-vector products. 

Lemma 4.1 (Power Method with inaccurate matrix-vector products). Let M be a positive 
definite matrix with largest eigenvalue Ai and smallest eigenvalue Xd- Fix an accuracy param¬ 
eter e > 0. Let w be an arbitrary unit vector and consider the following sequences of vectors 

defined as follows: 


Wn = Wn = W, 


Vt > 1 ; wl 


Mw 


Wi 




w+ 


w; 


Wo = wq = w, Vt > 1 : w^ satisfies that ||wt — Mwt_i|| < e, w^ 


Wf 

|wd 


Denote: 


r(M,t) := 


A* 


Ai-l 


if Xi = 1 

if Xifil 


r(M,t) := 


t i/ Ai = 1 


Then it holds that for all t > 0, 


and 


Proof. First, observe that: 


|wt — wj' II < e • f (M, t) 
|wt — wj' II < e • r(M, t) 


||wt+i — Wj_,_]^|| = ||wt+i — Mwt-|-Mwt — Wj_,_]^|| < ||wt+i — MwtII-|-||Mwi — Mw^l 

< e -|- Ai • ||wt — w^||. 

In case Ai = I, we clearly have that 

||wt+i - Wj%^|| < (t + I)e -t ||wo - WqII = (t + l)e. 

Otherwise, in case Ai 7 ^ I, by a simple algebraic manipulation we have that 

e , . eAi 


(7) 


|wt+i - w^+ill -h 


Ai — 1 

It thus follows that for all t > 0, 


< Ai • ||wi - w!|| 


Ai-I 


— Ai 


wt - wJ -h 


Ai — I 


< A‘+' (||w„ - *511 + ‘ - 1)' 


\\Wt+l - W^+i 

We now have that for all t > 1 it holds that 

„ Wt wt ,, 


( 8 ) 


|Wt - Wj II = 


< 


|Wt|| 

Wt 


W 


W 


+ 


W 


WT 


Wt 

|wt| 

Wt 

|wt| 


Wt Wt 


Wj 


w; 


Wt 


WT 


WT 


WT 


WT 


WT 


|wt - Wtll + l|wt 

|wt - Wtll 


1 


l|Wt|| 
Wtll - ||wt| 


WT 


< 


2 ||wt - w^ 


w: 


WT 
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where the last inequality follows from the triangle inequality. 

Now, since Wj = M*w and M ^ A^I, it follows that 

II Will > ||w|| =X\. 

Combining this with Eq. Q and ([7]), we have that for all t > 1, 


|wt — wj'll < 


2e 


t 


Ai-l 


if Ai = 1; 
if Ai / 1. 


Thus the lemma follows. 


□ 


The following corollary is a consequence of the convergence result for the Power Method 
(Theorem [2T]) and Lemma l4.11 

Theorem 4.1 (Convergence of the Power Method with inaccurate updates). Fix e > 0 and 
p > 0. Let M 6e a positive definite matrix with largest eigevalue Ai and eigengap 5 > 0. 
Consider a sequence of veetors where ■wq is a random unit vector, and for all t G [r] it 

holds that ||wt — M-Wi_i|| < e := r) Lemma^J^for definition o/r(M, r)J. Then the 
following two guarantees hold: 

1. If T > Tcr^de(^/‘^’P) i^en with probability at least 1 — p it holds that 

wJMwt- > (1 — e)Ai. 


2. If T > T^^(k(M), e/2,p)t/ien with probability at least 1 — p it holds that 

(w>i )2 > 1 -e, 

where ui is the largest eigenvector o/M. 

In both cases, the probability of success depends only on the random variable (wqUi)^. 

Proof. Let {w^j^g be a sequence of unit vectors as defined in Lemma l4.ll For the first item, 
note that 

Since M is positive definite, the function /(w) = w'^^Mw is convex and we have that 

IwJMwt-— w*'''Mw*I < 2 max{(MwT-)'^(-WT-—-w*), (M-w*)'''(-w* —-Wt)} 

< 2Ai(M) • ||wt-— w* ||. 

Thus we have that 

■wJMwt > w*'''Mw* — 2Ai(M) • II-Wt-—- w* II 
> w;Tmw;-|ai(m), 

where the last inequality follows from Lemma 14.11 and our choice of e. 

Now, by our choice of r and Theorem 12.11 we have that with probability at least 1 — p it 
holds that 

wjMwr > (1-e/2)Ai(M) - |Ai(M) = (1-e)Ai(M). 
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For the second item, note that 


w. 


Ui)^ 


= (w^ 

> 


*T 


ui + (w^ - > 


' *T \2 
Ui) 


— 2 ||wt — w* 


*~r \ 2 

Ui) 


e 

2 ’ 


where the last inequality follows again from Lemma 14.11 and our choice of e. 

Again, by our definition of r and Theorem 12.11 we have that with probability at least 1 — p 
it holds that 


(wjui)^ > 1 — e/2 — e/2 = 1 — e. 


□ 


4.2 Convergence of Algorithm [3] 

The key step in the analysis of Algorithm [3] is to show that if all numerical computations are 
carried out with sufficiently small error, then Algorithm [3] successfully simulates the Power 
Method-based Algorithm [ 2 l A main ingredient in Algorithm [2] is the computations of the values 
Ag which are used in turn to update the estimates A(s). Our use of the values in Algorithm 
[ 2 ] was through the approximation guarantees they provided for the gap — Ai (recall we 

have seen that ^(A(s_i) — Ai) < < A(s_i) — Ai). The following lemma shows that with a 

correct choice for the numerical error parameter e, these guarantees also hold for the values A^ 
computed in Algorithm [3l 

rA SAT 1 

Lemma 4.2. Suppose that 6 € [ 2 ’tJ ^ — Te ( 8 ) ’ where mi is as defined in 

Algorithm 0 Then with probability at least 1 — p it holds that for any s, the update of the 
variables As,A( 5 ) in Algorithmic is equivalent to that in Algorithmic In particular, for all 
s > 1 it holds that 

“ -^ 1 ) < “ ^ 1 - 


Proof. For clarity we refer by A^, A(s) to the values computed in Algorithm [3] and by A^, A(s) to 
the corresponding values computed in Algorithm [2] from Section [3l The proof of the lemma is 
by induction on s. For the base case s = 0 , clearly A(o) = A(o) and both Aq, Aq are undefined. 
Consider now some s > 1. 

Suppose for now that 


e < min{- 


Xi{Mj^) 


}• 


16F(Mr\mi)’ 8 

Then it follows from Theorem 10 and our choice of mi that with probability at least 1 — p, 


(9) 


wJm^ > ^Ai(M^ ^). 


( 10 ) 


By the definition of the vector in Algorithm [3] and the Cauchy-Schwartz inequality it 
holds that 


wjvg = wJm^ Vs -h wJ(v^ - Ms Vs) G [wJm^ Vs - e,wjMs Vs e]. (11) 

Thus, combining Eq. (uni) and m, we have that 

wjvs - e G [avJm^Vs - 2e,wjM7Vs] C [3Ai(M7^)/4 - 2e, Ai(M7^)]. 
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By our choice of e it follows that 

wJv,-6€[Ai(M7i)/2,Ai(M7i)]. 


Thus, the computation of the value of A^, and as a result the computation of A(s), is identical 
to that of As, A(s), and the claim follows. 

It only remains to give an explicit bound on e, as defined in Eq. ([9]). For this we need to 
upper bound r(M7^,mi) for all values of s. Recall that from the results of Section[3]it follows 
that the values A(s) are monotonically non-increasing (see Eq. ([3|) in proof of Lemma 13.21) and 
lower-bounded by A(j) > Ai -|- 5/4 fLemma 13.31) . By our assumption of 5 we have that 


Ai(M 7 ^) = 


■5(s-i) — Ai 


> 


A(o) - Ai 


1 + 5-Ai 


> 


1 + ¥-5 


5 5 

> ! + ->! + -, 


( 12 ) 


where the second inequality holds since by definition of 5 it follows that Ai = A 2 + 5 > 5. 
Fix a natural number t. By the definition of r(M,t) (see Lemma l4.1|) we have that 


T{M-\t) = 


< 


< 


2 


XdiMjy 

-1 

5 ^(s-i) 

5 ' 

K\s-i) 

-1 

y+ 5 ' 

5 ^ 



■ Ai(Ms-^) - 1 “ Ai(Ms-i) - 1 


< 6 / ^( 0 ) y 

- Ai/ ~ S \\f) - Ai/ 



\Xd{M7^)J 


(13) 


where the first inequality follows from Eq. the second inequality follows from the bounds 
Ad > 0 and A(j) < A(s_i) < A(o), and the third inequality follows from plugging the value of A(o) 
and from Lemma 13.31 


Thus, it follows that we can take e < minji ( | 


mi+l 






mi+1 


□ 


Theorem 4.2 (Convergence of Algorithm [3|) . Suppose that 5 E [5/2,35/4]. Fix e > 0 . Let 
'mi > "^2 > T™{3,e/2,p). Suppose that 1, satisfies that 


e < mini — 
^16 



e 

4 



m 2 +l 

}• 


Then, with probability at least 1— p it holds that the output of Algorithmic the unit vector Wf, 
satisfies that 


(wjui)2 > 1 - e, 

and the total number of calls to the convex optimization oracle A is 

O ^log((i/p)log(5"^) + log(^)^ . 


16 














Proof. First, note that as in the proof of Theorem 13.11 since all the noisy simulations of the 
Power Method in Algorithm [3] are initialized with the same random unit vector wq, they all 
succeed together with probability at least 1— p (provided that the other parameters 
are set correctly). 

By our choice of e and Lemma 14.21 it follows that we can invoke the results of Section [3l 
By Lemma 13.21 we can upper bound the number of iterations made by the repeat-until loop by 
0(log(5“^). Since each iteration of the loop requires mi + 1 calls to the optimization oracle A, 
the overall number of calls to A during the loop is 0(mi log(5“^)). 

By Lemma 13.31 we have that the final estimate satisfies that A(j-) — Ai < ^ < ^ < 25. 
Thus, by Lemma l2.II we have that 


1 Ai(M7i 
k(M7^) = ^ 


5{MJ^^ ~ 


< 3. 


(14) 


Suppose now that e < 


4r(M-Lm2) 


. By our choice of m 2 and Eq. dm), it follows from Theorem 


14.11 that with probability at least 1 — p indeed (wj u )^ > 1 - e. 

The number of calls to the oracle A in this final part is m 2 . Thus overall number of calls to 
A in Algorithm [3] is 0(mi log(J“^) + m 2 ). 

It remains to lower-bound the term 4 p(]y[-i ^ ^ • Following the analysis in the proof of 

m2) < (1) 


Lemma 02] (Eq. (fT3|) L we can upper bound r(Mj 
bound on e stated in the theorem. 


, which agrees with the 

□ 


In order to analyze the arithmetic complexity of Algorithm [3| using a specific implementation 
for the optimization oracle A, it is not only important to bound the number of calls to A (as 
done in Theorem l4.2l) . but to also bound important parameters of the optimization problem ([6|) 
that naturally arise when considering the arithmetic complexity of different implementations 
for A. For this issue we have the following lemma which follows directly from the discussion in 
Section [3| and the assumptions stated in Theorem 14.21 

Lemma 4.3. Let A, w be sueh that during the run of Algorithmic the optimization oraele A is 
applied to the minimization of the function 

^w,a(z) = 

Then, under the conditions stated in Theorem If. 21 it holds that 
1- Ew,a(z) is (A — Ai(X)) = Q(5)-strongly convex. 

2. for all i G [n] it holds that the function fi{z) = ^z"'"(AI — XjX^)z — w^z is 1 5 = 0(1)- 

smooth. 

3. log(||z*||) = 0(1), where z* is the global minimizer o/Ew,a(z). 


5 Putting it all together: Fast PCA via SVRG 

In this section we prove Theorems 11.1111.21 

Following the convex optimization-based eigenvector algorithm presented in the previous 
section - Algorithm [S] we consider the implementation of the convex optimization oracle A, 
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invoked in Algorithm El using the SVRG algorithm [T3] discussed in subsection I2.2.31 Recall 
that the oracle A is used to solve Problem ([6]) for some parameters A, w. Indeed the objective 
Fvv,a(z) in Q could be written as a finite some of functions in the following way: 

Further, recall that for A > Ai(X), Fw,a(z) is always (A — Ai(X))-strongly-convex and that for 
every i £ [n], the function 

fi{z) := iz’^(AI-Xix7)z-w’^z, 

is max{A, ||xj|p — A} smooth. However, /j(z) need not be convex. Hence the SVRG theorem 
from m could not be directly applied to minimizing (I15p . However, we prove in the appendix 
that the SVRG method still converges but with a slightly worse dependence on the condition 
number [^. 



Below we give an 

explicit implementation of the the SVRG algorithm for minimizing (|15p. 

Algorithm 4 SVRG FOR PGA 

1 

Input: A G M, X - 

1 T rri 

= n Si=iXiX.' ,w,r],m,T. 

2 

zo 0 


3 

for s = 1, ...T do 

4 

z Zs_i 


5 

X)z 

- Wt_i 

6 

Zo f- z 


7 

for t = 1, 2,..., 

m do 

8 

Randomly pick G [n] 

9 

Zt £- Zt-l - 

rj ((AI - Xj,xT)(zt_i - z) + /2) 

10 

end for 


11 

~ J_ ^pm-l 

Zt 

12 

end for 


13 

return zt 



The following theorems are proven in the appendix. 

Theorem 5.1. Fix e > 0,p > 0. There exists a choice ofr],m such that Algorithm^finds with 
probability at least 1 — p an e-approxiated minimizer of (USD in overall time 

°("+(A-Ai(X))2)' 

Based on the recent acceleration framework of [12] we also have the following result (the 
proof is given in the appendix). 

Theorem 5.2. Fix e > 0,p > 0. Assume that A — Ai = 0{^d/N). There exists an accelerated 
version of Algorithm^ that finds with probability at least 1 — p an e-approximated minimizer of 
dEI) in overall time 


6 


jY3/4^1/4 \ 

v/A-Ai(X)J ■ 


^We note that in [2^ it was shown that the same kind of result holds also for the SDCA method that was 
originally introduced in |23| . 
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5.1 Proving Theorems II. II 11.21 

The proof of Theorem 11.11 follows from the bounds in Theorem 14.21 Lemma 14.31 and Theorem 

O 

The proof of Theorem II.21 follows from the bounds in Theorem l4.2l Lemma [4.3l and Theorem 

O 


6 Proof of Theorems II.3111.41 

In this section we prove Theorems 11.31 and 11.41 Since the algorithm and corresponding proofs 
basically follow from the results of the previous sections, we give the algorithm and a sketch of 
the proofs. 

The algorithm is the same as Algorithm [3l but intuitively, it replaces the estimate for the 
eigengap 5 with the desired approximation error for the top eigenvalue, e. This is intuitive since 
now, instead of finding a unit vector that is approximately entirely contained in the span of 
vectors {uj | > Ai — (5} = {ui}, we wish to find a unit vector that is approximately entirely 

contained in the span of vectors {uj | A^ > Ai — e}. 


Algorithm 5 Leading Eigenvalue Approximation via Convex Optimization 
1: Input: matrix X G such that X ^ 0, Ai(X) < 1, accuracy parameter e G (0,1), failure 

probability parameters p, positive integers mi, m 2 , numerical accuracy parameter e 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


^(0) ^ 1 + e 

Let wq be a random unit vector 
s ^ 0 
repeat 
s s + 1 

Let Ms = (A(s-i)I ~ 

for t = 1.. .772-1 do 

Apply Algorithm A to find a vector w* such that ||wt — 

end for 


Ws 


Apply Algorithm A to find a vector v^ such that 
As 


-M: 


Wo 


Wi_i|| < e 


< e 


\s) 


1 

2 

- A 


wJ Vs—e 


(s-1) 


As 

2 


until As < e 


A 


(/) 


A 


(s) 


Let M/ = (A(j)I-X) 

for t = 1...7772 do 

Apply Algorithm A to find a vector such that ||wi — Mj^wi_i|| < e 

end for 

return wt- ■(— << 


The following simple lemma is of the same flavor as Lemma 12.11 and shows that we can 
benefit from conditioning the inverse matrix (AI — X)“^, even if the goal is only to approximate 
the leading eigenvalue and not the leading eigenvector. 

Lemma 6.1. Fix e > 0 and a scalar a > 0. Let = (AI — X)~^ such that Ai(X) + a • e > 
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A > Ai(X). It holds for all i G [d] such that Aj(X) < Ai(X) — e, that 


Proof. It holds that 


Ai(M-i) 

Ai(M-i) 


Ai(M-^) 

A,(M-i) 


> 1 + a 


A-Ai 


(A - Ai) 


1 + 


Ai — Aj 

A-Ai 


> 1 + 


Ai — Aj 

ae 


Thus, for any i G [d] such that Ai — A* > e it follows that > 1 + a □ 

In order to prove the convergence of Algorithm [5] we are going to need a slightly different 
result for the Power Method (Algorithm [T]) than that of Theorem 12.11 This result follows from 
the same analysis as Theorem 12.11 For a proof see the proof of Theorem 12.11 


Lemma 6.2 (Convergence of Power Method to span of top eigenvectors). Let M be a pos¬ 
itive definite matrix and denote its eigenvalues in descending order by AI, A 2 ,..., A^, and let 
ui,U 2 ,...,Urf denote the corresponding eigenvectors. Fix €i ,€2 G (0,1) and failure probability 
p > 0. Define: 

T^^{€i,e2,p) := r^ln 

2ei VP ^ 2 / 

Then, with probability 1 — p it holds for any t > T^'^(ei, 62 ,p) that 

(w7ui)^<e2. 

Aj<(l—ei)Ai 


The probability of suecess depends only on the random variable (W([ui)^. 

Theorem 6.1. There exists a choice for the parameters mi,m 2 ,e such that Algorithmic when 
implemented with SVRG as the optimization oracle A, finds in time O (-4 + dV) a unit vector 
•Wf that with probability at least 1— p satisfies, 

wJXwj > Ai — e. 

Proof sketch. Following the same lines of the analysis presented in Theorem 14.21 there exists 
mi = 0(1) and e satisfying log(l/e) = 0(1), such that the repeat-until loop terminates after 
0(1) iterations with a value A(j) such that |e > A(j) — Ai > |e. 

Denote S{e) = {i G [d] | Aj > Ai — e/2}. By Lemma IHTl we have that for all i ^ S{e) it holds 
that Ai(M/^) < |Ai(M-^). 

Thus, by applying Lemma 16.21 with respect to the matrix with ei = \, €2 = e/2, we 
get that for m 2 = 0(1), it follows that > 1 — e/2. Thus it follows that 


wjXw/- = 


> 


d 


^Ai(wjui)^> ^ Ai(wjui)^ > (Ai - e/2) ^ (wjuj)^ 

*=1 i£S{e) i£S{e) 


(Ai-e/2)(l-e/2) > Ai-e. 


Since on every iteration s of the repeat-until loop it holds that A^^) — Ai = D(e), similarly 
to Lemma we have that the optimization oracle A is applied to solving 0 (l)-smooth and 
D(e)-strongly convex problems. Hence, by the SVRG theorem. Theorem 15.11 each invocation of 
A when implemented using the SVRG algorithm, requires O (^ + time. Hence, the theorem 
follows. □ 


20 









Theorem Ol follows from Theorem ED and the observation that by using a standard con¬ 
centration argument for matrices H, instead of applying the algorithm directly to the matrix 
X = i it suffices to apply it to the matrix X = ^ ^*^0 where n' = 0(e“^) and 

the vectors are sampled uniformly from Hence, we can basically substitute N 

with d/e^ in the running time stated in Theorem 16.11 From this observation it also follows that 
Theorem 11.31 holds also when X is not given explicitly as a finite sum of rank one matrices, but 
X = Ex ..^X)[xx''~] for some unknown distribution T) from which we can sample in 0{d) time. 

Theorem [Til follows if we replace the use of the SVRG algorithm in the proof of Theorem 
IQ with its accelerated version, i.e. use Theorem 15.21 instead of Theorem EH 

7 Dense PCA and Acceleration of Algorithms for Semidefinite 
Optimization 

So far we have considered the problem of computing an approximation for the largest eigenvector 
of a matrix X given as a sum of rank-one matrices, i.e. X = However, our 

techniques extend well beyond this case. In fact, the only place in which we have used our 
structural assumption on X is in our implementation of SVRG for PCA given in Algorithm |T] 
and the corresponding runtime analysis. 

In this section we consider the following natural generalization of the PCA problem which 
we refer to as dense PCA: we assume X is of the form 

n 

X = '^piAi, 
i=l 

where = 1, Vi G \n]pi > 0 and Vi € [n] Aj is a symmetric real dx d matrix. We further 

assume that Vi G [n] ||Aj|| < 1 and X ^ 0. We focus here on the problem of fast approximation 
of Ai(X), i.e. finding a unit vector w such that 

w^Xw > Ai(X) — e. 

Attaining such an approximated eignevector for X, as considered in this section ,is an 
important subroutine in several algorithms for Semidefinite Programming [a HU 0 i and 
Online Learning PttSKIlE]. 

For the purposes of this section we denote by N the total number of non-zeros in all matrices 
Ai,..., A„, and in addition we denote by S the maximal number of non-zero entries in any of 
the matrices Ai,..., A„. 


7.1 SVRG for Dense PCA 

As discussed above, our fast PCA algorithms and corresponding analysis could be directly 
applied to the dense case under consideration in this section. The only thing that needs to be 
changed is the application of the SVRG algorithm. A modified SVRG algorithm for the dense 
PCA problem is detailed below. Specifically, we apply SVRG to the following optimization 
problem: 


n 

min{Fw,A(z) := (z"^(AI 

2 = 1 


Aj)z — w'^z 


®See for instance the Matrix Hoeffding concentration inequality in |26| . 


(16) 
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Algorithm 6 SVRG for Dense PC A 


1 

Input: A G M, X 

= EiLiPi^o w, r],m,T. 

2 

zo 0 


3 

for s = 1, ...T do 

4 

Z ^ Zs-l 


5 

{\1- X)i 

1 - Wt_i 

6 

Zo ^ z 


7 

for t = 1,2 ,... 

m do 

8 

Pick it G [n] 

according to probability distribution {pi,P 2 , ■■■,Pn) 

9 

Zt ^ Zt-l - 

T] {{XI - Aifi{zt-i -i)+p) 

10 

end for 


11 

^2 _L 

^ m Z^t=0 

Zt 

12 

end for 


13 

return zj- 



Note that here, as opposed to the standard PCA problem, we assume that X is given by a 
weighted average of matrices and not necessarily a uniform average. This difference comes into 
play in Algorithm [6] when we sample a random gradient index it according to the weights {pi}2=i, 
and not uniformly as in Algorithm 01 This change however does not change the analysis of the 
algorithm given in Theorem lB.il and thus we have the following theorem which is analogues to 
Theorem EH 


Theorem 7.1. Fix e > 0,p > 0. There exists a choice ofrj,m such that Algorithmic finds with 
probability at least 1 — p an e-approxiated minimizer of (USD in overall time 


6 



dFS \ 

(A-Ai(X))2;- 


By Plugging Theorem 17.11 into the proof of Theorem ll.3l instead of Theorem 15.ip we arrive 
at the following theorem (analogues to Theorem 11.31) . 


Theorem 7.2. Fix e > 0,p > 0. There exists an algorithm that finds with probability at least 
1 — p a unit vector w such that w'''Xw > Ai — e, in total time O . 


7.2 Faster sublinear-time SDP algorithm 

Here we detail a specihc application of Theorem 17.21 to accelerate the sublinear-time approx¬ 
imation algorithm for Semidefinite Programming recently proposed by Garber and Hazan [8]. 
Towards this end we consider the following semidefinite optimization problem: 

max min W»Ai-bi, (17) 

W; WbO,Trace{W)=l *e[n] 

where we assume that Vi G [n]: Aj is symmetric, ||Aj|| < 1, HAjHj? < F for some F, and 
\bi\ < 1. Note that under the assumption ||Aj|| < 1 it holds that F < '/d. We use A • B to 
denote the dot product J2ij 

As before we denote by N the total number of non-zero entries the matrices Ai,..., A„ and 
by S the maximal number of non-zero in any single matrix. 

To the best of our knowledge the algorithm in [8] is the current state-of-the-art for approx¬ 
imating Problem (11711 for a wide regime of the input parameters d, n, e. 
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Theorem 7.3 (Theorem 1 in [S]). There exists an algorithm that finds in total time 

a solution Wj such that with probability at least 1/2 it holds that 

min W f • Ai — bi > max min W • A, — 6, — e. 

W: W^O, Trace(W)=l i£[n] 

The algorithm performs roughly iterations where each iteration is comprised of two 
main parts: i) low-variance estimation of the products Aj • ww"'" Vi G [n] for some unit vector 
w G which are used to obtain a probability distribution p G over the functions {/i(W) := 
Aj • W — bi}2^i, and ii) an 0(e) - approximated leading eigenvalue computation of the matrix 
where p is a distribution as discussed above. The term on the right in the running 
time stated in Theorem Ea comes from this approximated eigenvalue computation which is 
done according to either the standard Lanczos method or the Sample Lanczos method detailed 
in Table fT7l.ll By replacing the Sample Lanczos procedure with Theorem 17.21 we arrive at the 
following improved Theorem. 

Theorem 7.4 (Accelerated sublinear SDP solver). There exists an algorithm that finds in total 
time 

a solution Wj such that with probability at least 1/2 it holds that 

min W f • Ai — bi > max min W • Aj — — e. 

iG[n] W: W^O, Trace(W)=l 

A slight technical issue is that the queried matrix X = need not be positive 

semidefinite as we assumed in our results, however under our assumptions we can easily apply 
our results to the matrix X = ^ which is positive semidefinite, which only slightly 

affects the leading constants in our theorems. 
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A Convergence of the Power Method 

We first restate the theorem and then prove it. 

Theorem A.l. Let Ni be a positive definite matrix and denote its eigenvalues in descending 

order by Ai, A2,..., A^, and let ui,U 2 ,...,Ud denote the corresponding eigenveetors. Denote 6 = 

Ai — A 2 and k = ^. Fix an error toleranee e > 0 and failure probability p > 0. Define: 



Then, with probability 1 — p it holds that 


1. (crude regime) Vt > Tcrnde(^’7')- > (1 — e)Ai. 


crude 


2 . (accurate regime) Vt > T)(^{K,e,p): (w^ui)^ > 1 — e- 
In both cases, the success probability depends only on the random variable (w([ui)^. 
Proof. By the update rule of Algorithm (H it holds for all i G [d] that 



2 


((M*wo)'^Ui)^ _ _ (A*W([uj)^ 

||M*WoP W([M2iwo Ej=i (Wq Uj)^ 




(18) 


Since wq is a random unit vector, according to Lemma 5 in [3], it holds that with probability 
at least 1 — p, (wq ui )2 > g. Thus we have that with probability at least 1 — p it holds for all 
i G [d] that 



(19) 
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Given e G (0,1), define 5(e) = {i G [d] | A* > (1 — e)Ai}. Fix now ei, 62 G (0,1) and define 

r(ei,e2,p) := 

2ei Vp £ 2/ 

According to Eq. m, with probability at least 1 — p we have that for t > T(ei,e2,p), 
for all i ^ 5(ei) it holds that (w^Uj)^ < e2(wg'uj)^, and thus in particular it holds that 

Eie5(ei)(w7ui)2 > l-e2. 

Part one of the theorem now follows by noticing that according to the above, by setting 
ei = €2 = e/2 we have that with probability at least 1 — p, for t > Tcr^dei^^P) ~ T{el2,el2,p), 
it holds that 

d 

wJMwj = ^ Ai(w7ui)^ > ^ (1 - e/2)Ai(w7ui)^ > (1 - e/2)^Ai > (1 - e)Ai. 

*=1 ieS(e/2) 

For the second part of the theorem, note that 5((5/Ai) = {!}. Thus we have that with 
probability at least 1—p, for all t > 5, €,p) = T{6/Xi, e,p), it holds that (w7ui) > 1 —e. 

□ 

B SVRG for Convex Functions given by Sums of Non-convex 
Functions and its Acceleration 

Suppose we want to minimize a function F{x) that admits the following structure 

1 " 

^(x) = - (20) 

i=l 


where each /j is j3 smooth, i.e. 

||V/i(x) - V/i(y)|| < /3||x-y|| Vx,y G M'^, 


and E(x) is u-strongly convex, i.e. 

F(y) <F(x) + (y-x)TVF(x) + |||x-y|7 Vx,y gM'". 


Algorithm 7 SVRG 
1: Input: XQ,p,m 
2: for s = 1, 2,... do 

3: X Xs_i 

4: /i •(— VF(x) 

5: Xq ■(— X 

6: for t = 1, 2,..., m do 

7: Randomly pick it G [n] 

8: Xt ^ Xf_i - p {Vfi^{xt-i) - V4(x) +/2) 

9: end for 

10: X, ^ - Ylt=0 Xt 

11: end for 
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Theorem B.l. Suppose that each function /i(x) in the objective (I2ni) is (3-smooth and that 
F(x) is a-strongly convex. Then for rj = and m > holds that 

E[||i,-x*f] <2-^||io-x*f. 


Proof. We begin by analyzing the reduction in error on a single epoch s and then apply the 
result recursively. Let us fix an iteration t G [m] of the inner loop in epoch s. In the sequel we 
denote by Et[-] the expectation with respect to the random choice of it (i.e., the expectation is 
conditioned on all randomness introduced up to the tth iteration of the inner loop during epoch 
s). Define 


Vi = V4(xi_i) - V/i4(x) + fi. 


Note that Ei[vi] = VF(xt_i) and thus vt is an unbiased estimator for VF(xi_i). We continue 
to upper bound the variance of vt in terms of the distance of Xi_i and x from x*. 

Ei[||vif] < 2Ei[||V4(xi_i) - V4(x*)f] + 2Ei[||V/,,(x) - V/,,(x*) - VF(x)f] 

= 2Ei[||V4(xi_i) - V4 (x*)|| 2] + 2Ei[||V4(x) - V/,,(x*)f] 

- 4VE(x)^(VF(i) - VF(x*)) + 2||VF(x)f 

< 2Ei[||V4(xi_i) - V4 (x*)|| 2] + 2Ei[||V/,,(x) - V4(x*)f] 

< 2/32(||xi_i-x*f+ ||i-x*f), 


where the first inequality follows from (a + b)'^ < 2a? + 26^, the first equality follows since 
Ei[/jj(x)] = VE(x) (same goes for x*), the second inequality follows since VE(x*) = 0, and 
the third inequality follows from smoothness of fi^. 

We now have that, 

Ei[||xi-x*|p] = ||xi_i-x*||^-2r/(xi_i-x*)^Ei[vi]+r/^Ei[||vi|p] 

< ||xi_i - x*|p - 2r/(xi_i - x*)’’’VF(xi_i) + 2rf‘0^{\\xt-i - x*||^ + ||x - x*|p) 

< ||xi_i - x*||^ - 2r]a\\xt-i - x*|p + 2Trf0^{\\xt-i - x*|p + ||x - x*|p), 


where the second inequality follow from convexity and strong-convexity of F. 

Thus we have that, 

E[||xt - x*f] - E[||xi_i - x*f] < 277 ( 77 / 3 ^ - cr)E[||xt_i - x*||^] +2if0^F[\\x - x*|p]. 
Summing over all iterations of the inner loop on epoch s we have 


E[||x,n -x*||^] -E[||xo -x*|p] < 277 ( 77 / 3 ^ - a) ^E[||xi_i -x*|p] 277777 ^/3^E[||x - x*|p]. 

7=1 

Rearranging and using xg = x we have that, 

m 

277 ( 0 - - 77 , 5 ^) ^E[||xt - x*|p] < (1 + 277777^/3^)E[I|x - x*|p]. 


7=1 


Using X = Xs_i and x^ = ^ have that. 


TT^ni- *ii2i / 1 + 2777772/32 2i 

Elllx., - X II I < ::-^-^E| x^.i - x 


277777(0- — 77/32 
Plugging the values of 77 , 777 gives the theorem. 


□ 
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B.l Acceleration of Algorithm [3 

We now discuss how using the recent generic acceleration framework of [12] , we can further accel¬ 
erate Algorithm [7| Note that Algorithm [7] requires to compute overall 0{n + m) = O 
gradients of functions from the set Using the framework of [12], this quantity could 

be dramatically reduced. 

On a very high-level, the framework of [12] applies a convex optimization algorithm in an 
almost black-box fashion in order to simulate an algorithm known as the Accelerated Proximal- 
point algorithm. That is, it uses the convex optimization algorithm to find an approximated 
global minimizer of the modified function: 


where k, y are parameters. 

Note that the SVRG algorithm ([7]) could be directly applied to minimize (1211) by considering 
the set of functions /i(x) = /i(x) -|- ^||x — y|p for all i E [re]. It clearly holds that T(x) = 
'Ya=i Not also that now T(x) is a -|- re strongly convex and for each i E [re] it holds that 

/j(x) is /3 -|- re smooth. 

The following Theorem (rephrased for our needs) is proven in [12] (Theorem 3.1). 


Theorem B.2. Fix the parameter re. There exists an acceleration scheme for Algorithm [7] 


that finds an e-approximated minimizer of (EOD, after approximately minimizing O 
instances of m- 


CT+K \ 


Plugging Theorems IB . 11 |B?2] and the proprieties of T’(x) we have that Algorithm [7| could be 
applied to finding an e minimizer of (|20p in total time: 


O 



fi + K 
a + K 



where Tq denotes the time to evaluate the gradient of F and Tg denotes the worst case time to 
evaluate the gradient of a single function /j. 

By optimizing the above bound with respect to re we arrive at the following theorem. 


Theorem 

0{d) time. 

framework 


B.3. Assume that the gradient vector of each function /i(x) could be computed in 
Assume further that (5 = fl{^/TG/Tga). Algorithm^ combined with the acceleration 

ofm, finds an e-approximated minimizer of (EQl) in total time O 


I I3rp3/irpl/i 
-L n 


-G 
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